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A METHOD FOR ROBUSTLY DETECTING VOICE ACTIVITY 
Background of Invention : 

Voice activity detection (VAD) techniques have been widely used in digital voice 
communications to reduce voice data rate to achieve either spectral efficient voice 
transmission or power efficient voice transmission for wireless devices. The essential part 
of VAD algorithms is to effectively distinguish voice signal and background noise signal, 
where multiple aspects of signal characteristics, like energy level, spectral contents, 
periodicity and stationarity, etc., have to be explored. Traditional VAD algorithms tend to 
use heuristic approaches to apply some limited subset of the characteristics to detect 
voice presence, which, in practice, are very difficult to achieve high voice detection rate 
and low false alarm rate due to the heuristic nature of the technique. To address the 
performance issue of heuristic algorithms, more sophisticated algorithms are developed 
to simultaneously monitor multiple signal characteristics and try to make a detection 
decision based on some joint metrics. These algorithms do demonstrate good 
performance, but at the same time, they often lead to complicated implementations or 
inevitably become an integrated component of some specific voice encoder algorithm. 
Lately, a statistical model based VAD algorithm is studied and shows good performance 
and simple mathematical framework [1]. The challenge, however, to make this new 
algorithm practical to effectively estimate both voice and noise signal power on each 
frequency component. 

Detailed Description of invention 

The invention disclosed here describes a robust statistical model based VAD algorithm, 
which does not rely on any presumptions of voice and noise statistical characters and can 
quickly train itself to effectively detect voice signal with good performance. What makes 
it more attractive is that it works as a stand-alone module and is independent of the type 
of voice encoders. 

The key advantages of this method are: 

a. Use statistical model based approach with proven performance and simplicity. 

b. Self-training and adapting without reliance on any presumptions of voice and 
noise statistical characters. 

c. An adaptive detection threshold that makes the algorithm work in any signal-to- 
noise ratio (SNR) scenarios. 

d. A generic stand-alone structure that can work with different voice encoders. 

1. Mathematical Framework 

The underlying mathematical framework for the algorithm is the log likelihood 
ratio of the event when there is noise only and the event when there are both voice 
and noise. It can be mathematically formulated as: 
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Let y{t) = x(t) + n(t) be a frame of received signal and Y be its corresponding 
pre-selected set of complex frequency components. Further, two events are 
defined as: 



Y = N, 
Y-X + N, 



as Hq speech absent, 
as Hj - speech present, 



Where, X and N are corresponding pre-selected set of complex frequency 
components of voice x ( t ) and n { t ) respectively. It is sufficiently accurate to 
model Y as a jointly Gaussian distributed random vector with each individual 
component as an independent complex Gaussian variable, and Y's PDF 
conditioned on Ho and Hj can be expressed as: 



*=>o 

L-l 



2 > 



where, Xx(k) and X^k) are the variances of the voice complex frequency 
component^ and the noise complex frequency component Nk respectively. 

Let log likelihood ratio (LLR) of the Ath frequency component be defined as: 



where, ^ k and yk are the so-called a priori signal-to-noise ratio (pri-SNR) and a 
posteriori signal-to-noise ratios (post-SNR) respectively, as defined: 



4 = 



n = 



X N {k) 



Then, the LLR of vector Y given H 0 and Hi , which is what a VAD decision 
based on, can expressed as: 



log(A) = £log(A, ) = £log( X Lgl ) = £ 
* k p[r k \H 0 ) k 



A LLR threshold developed based on SNR level can be used to make a decision 
on if voice signal is present or not 
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2. Basic Operations 

The general flow of the algorithm is illustrated in Figure 1, and each function 
block is explained in details as follows: 
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Figure 1 Flow diagram of VAD algorithm 

1. For a inbound 5-ms signal frame of 40 samples, 32/64-point FFT is 
performed. If 32-point FFT is performed, 40-sample frame is truncated to 32 
samples. In the case of 64-point FFT, 40-sample frame is zero padded. 

Note: inbound signal frame size and FFT size can change depending on the 
implementation. 

2. From FFT output, sum of signal power over pre-selected frequency set is 
calculated and go through a l st -order IIR averager to extract long-term signal 
dynamics, as illustrated in Figure 2 and Figure 3. IIR averager's forgetting factor 
is chosen such that signal's peaks and valleys are kept. 
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Claims 

This invention disclosure claims the following: 

1 ) The method to use the statistical model based mathematical formulation 
to do VAD. 

2) The method to estimate and track voice signal and noise signal power in 
the frequency domain. 

3) The method to establish and adapt the LLR threshold for VAD detection. 
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Figure 2 Noise corrupted voice signal 




Figure 3 Signal dynamics after IIR averaging 



3. As signal power's dynamic is available, based on a pre-configured 
min/max signal level gap threshold, say 12dB, initial min/max level can be 
established using simple absolute level detector. Afterwards, a slow l st -order 
averager is used to slowly update two levels to follow signal's dynamic change 
based on pre-defined margin value. To build in high level of system stability to 
prevent min/max gap collapse, min level adaptation is designed such that it is 
quicker to adapt down than adapt up. Similar treatment is done on the max level 
adaptation as well. In the case the gap does collapse, the system is reset to re- 
establish valid min/max baseline. Figure 4 Illustrates what the min/max level 
looks like. 
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Figure 4 Establishment and tracking of min/max level 



4. Using the slowly-moving min/max levels as a baseline, the algorithm 
defines a range of signal to be considered as noise and voice respectively, and a 
l sl -order IIR averager is used to calculate noise power and voice power 
respectively. The establishment of noise and voice power is illustrated in Figure 
and Figure 6. 




Figure 5 Establishment of noise power profile 
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Figure 6 Establishment of voice power profile 



5. After both noise power and voice power are established, a pri-SNR profile 
against the frequency component set can be calculated and tracked, again, using a 
l st -order IIR averager. The result is shown in Figure 7. 




Figure 7 Establishment and tracking of pri-SNR profile 

6. After the pri-SNR profile is available, the corresponding post-SNR profile 
and LLR profile can be calculated on a frame-by-frame basis. With the 
availability of LLRs over time and the knowledge of what is considered as noise 
frames from step 4, LLR threshold can be established and tracked using a 1 st - 
order IIR averager. LLR distribution along the time and adaptation of LLR 
threshold are illustrated in Figure 8 and Figure 9. 
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Note: What is considered as noise frames in step 4 is not reliable enough for VAD 
purpose, as shown in Figure 9, where some of the LLR values are well above 
zero. 




Figure 8 LLR distribution over time 




Figure 9 LLRs of as considered noise frames and LLR threshold adaptation 

7. After the LLR threshold is available, silence detection is kicked in on a 
frame-by-frame basis. A frame is considered as silence if its LLR is below LLR 
threshold + x dB of margin and silence suppression is not triggered unless there 
are x number of consecutive silence frames (hang-over time). Figure 10 shows 
noise-removed voice signal. 
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Figure 10 Noise suppressed voice signal 



8/8 



